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ABSTRACT 

Previous studies of audiovisual (AV) speech 
integration have used behavioral methods to 
examine perception of congruent and incongruent 
AV speech stimuli. Such studies have investigated 
responses to a relatively limited set of the possible 
incongruent combinations of AV speech stimuli. A 
central issue for examining a wider range of 
incongruent AV speech stimuli is developing a 
systematic method for alignment that will work with 
a wide variety of segments. In the present study, we 
investigated the use of three different landmarks 
(consonant-onset, vowel-onset, and minimum 
distance) for aligning incongruent AV stimuli. 
Acoustic /ba/ or /la/ syllables were dubbed onto 
eight visual Consonant-/a/ syllables that spanned 
different places and manners of articulation. The 
AV stimuli were presented to ten participants. 
Results indicated that the effect of alignment 
landmark was not significant. The distance 
measures were found to be related to visual 
influence. Acoustic /ba/ tokens were more 
influenced by visual stimuli than acoustic /la/ 
tokens. Visual influence on the acoustic /ba/ tokens 
was mainly of the McGurk-type and/or of voicing 
confusion; while visual influence on the acoustic /la/ 
tokens was mainly of the combination type (/ba/ + 
/la/ = Ma/). 

1. INTRODUCTION 

Humans typically perceive and integrate information 
from multiple sensory channels [1; 2; 6; 10]. One of 
the most significant examples of multisensory 
integration is audiovisual (AV) speech perception 
(e.g., the McGurk effect [8] and AV speech 
perception in noisy acoustic conditions [11]). 

Congruent and incongruent AV speech stimuli have 
been used widely both in behavioral studies [3; 7; 9] 
and fMRI (or electrophysiological) studies [12; 13] 
with relatively little investigation into the nature of 
the physical stimuli being combined. These 
congruent and incongruent stimuli have elicited 



various behavioral and brain activation patterns, but 
interpretation of these results is limited by our 
understanding of the physical stimuli. 

Typical incongruent AV stimuli were AV speech 
signals with different temporal alignments [3; 7; 9], 
McGurk-style stimuli [12; 13], or an /iri/-/ili/ 
acoustic continuum plus visual /ibi/ [4]. However, 
these studies did not quantify the degree of 
incongruity between auditory and visual speech 
signals. Although the mismatched AV stimuli based 
on the different levels of synchrony resulted in 
graded levels of perceptual responses [3; 7; 9], the 
synchrony is only one of the factors that contribute 
to incongruity, and the acoustic and optical stimulus 
attributes should be taken into account. Given that 
quantitative results can be obtained from behavioral, 
neuroanatomical, and electrophysiological studies 
[12; 13], it is desirable to use mismatched AV 
stimuli with different quantified levels of 
incongruity to compare with the dependent 
measures in experiments. 

The quantitative examination of AV speech stimulus 
incongruity is a difficult task: It is not known yet 
which parts of the signals perceivers are sensitive to 
in response to AV stimulus incongruity, and at 
which cortical level the AV speech signals are 
bound. Currently there is no consensus in the 
literature regarding how to quantify the perceptual 
incongruity between auditory and visual speech 
signals. In the present study, we proposed a novel 
method for aligning congruent and incongruent AV 
stimuli with Consonant-/a/ (/Ca/) syllables and for 
quantifying incongruity. 

Incongruent AV speech stimuli were achieved 
through the mismatch between an acoustic 
consonant (/b/ or III) and visual consonants that 
spanned different places and manners of articulation 
(lb, d, g, v, z, 1, w, 5/). In the literature, synchronous 
AV speech signals are typically aligned based on 
consonant onsets (acoustic bursts) [3; 7; 9]. In the 
present study, consonants of different durations 
were used (e.g., auditory /ba/ versus visual /va/). To 
investigate effects of alignment, (1) consonant- 
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onset-based (C-), (2) vowel-onset-based (V-), and 
(3) minimum-AV-distance-based (M-) alignments 
were used to examine the effect of crossmodal 
alignment. 

For the current study, AV incongruity was modeled 
as the Euclidean distance between the acoustic 
signals, and what will be referred to here as 
phantom acoustic signals (i.e., the original acoustic 
signal from the visual speech token). Acoustic 
features were Line Spectral Pair parameters (LSPs) 
[see 5]. Because the vowel was held constant as /a/, 
the Euclidean distances were focused on consonants 
(including portions of the coarticulation; see Figure 
1). Given an auditory /Cia/ (A /C i a /) and a visual /C 2 a/ 
(V /C2a/), the distance between the auditory and visual 
(i.e., phantom auditory) speech stimuli was 
computed as: 

d Ar {T d ) = \LSP S A ' X " -LSP p s ^- l " \\ 2 , (1) 

where LSP A and LSPy were from A /C i a / and V /C 2a/> 
respectively. S\ and 52 represent the consonant onset 
points for A /C i a / and V /C 2a/> respectively. L A 
approximated the consonant duration in A /Cla/ . T d 
represents temporal shifting of A /C ia/ across V/c2a/- 
For the C-alignment, the phantom acoustic segment 
began at the consonant onset point, and thus the T d 
value was 0. For the F-alignment, the phantom 
acoustic segment ended at the vowel onset point, 
and thus the T d value was L v - L A . The M-alignment 
was obtained by sliding acoustic signals relative to 
the video and finding a minimal distance point (see 
Figure 1). An implication from Equation 1 is that 
L A has an effect on the distances: Acoustic ^a, da, 
ga/ tokens in general yield small intra-token 
distances. This method was our initial effort toward 
quantifying incongruity between auditory and visual 
speech signals, without implying any particular 
perceptual or neural representations. 

To assess perceptual consequences of the three 
alignment methods, normal hearing perceivers 
identified the stimuli in an open-set identification 
task. To examine the possibility that the large 
difference between A^ and A /la/ might draw 
attention to the audio and defeat visual influence 
effects, half of the participants viewed A/b a/ V and 
A/ia/V blocked presentations (blocked design), and 
another half viewed mixed A /ba /V and A/ia/V 
presentations (mixed design). Behavioral results 
were examined in terms of response types, auditory- 
based accuracy, visual-based accuracy, and some 
other contributing factors (e.g., talker differences). 



2. METHOD 

2.1 Talkers 

The talkers (with American English as a native 
language) were selected from a larger pool that had 
been initially screened for their visual intelligibility, 
as judged by deaf adults. Subsequent extensive 
additional visual-only speech perception testing, 
with 16 normal-hearing human subjects, of 320 
sentences produced by each of these four talkers 
showed that F2 was the most intelligible, then Ml, 
followed by M2 and Fl. These results were 
replicated with eight deaf lipreaders, except that M2 
was more intelligible than Ml [5]. 

2.2 Participants 

Participants were ten adults (age 19-29 years, mean 
age 22 years; five females) with normal hearing, 
American English as a native language, and normal 
or corrected-to-normal vision. All were screened for 
their lipreading ability, but their scores were not 
used to bar entrance to the experiment, only to 
provide insight into the results. Testing was 
approved by an Institutional Review Board. 
Participants gave informed consent and were paid 
$10 per hour. 

2.3 Speech Materials 

The corpus was part of a larger database [5]. The 
original database obtained with the four selected 
talkers consisted of 69 consonant-vowel syllables. 
Each syllable was produced at least two times in a 
pseudo-randomly ordered list. For the present study, 
eight syllables /ba, da, ga, va, za, la, wa, 5a/ were 
included. The voiced consonants were chosen, 
because they vary along places and manners of 
articulation. Two tokens (labeled with '1' and '2') 
for each syllable were selected for the present study 
[see 5]. 

2.4 Data Recording 

The recording system comprised a production 
quality SONY video camera, a SONY recorder, a 
Qualisys 3-D three-camera motion capture system, a 
DAT recorder, and a directional Sennheiser 
microphone [5]. Lighting and positioning were 
carefully adjusted to obtain clear realistic 
recordings. The talkers looked directly into the 
camera, and their faces filled the monitor. The 
microphone was positioned to be out of the way of 
video. All of the recorded data streams were 
synchronized [5]. The audio sampling frequency 
was 48 kHz for the video recordings. The optical 
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data and the audio from the DAT recordings were 
not used for the present study. 

2.5 Consonant and Vowel Onsets Labeling 

The C- and F-alignments of AV speech signals were 
achieved by hand labeling the consonant and vowel 
onsets. The acoustic features used for the labeling 
were transient, frication, aspiration, voicing, and 
high-frequency attenuation. Five consonants (/v, z, 1, 
w, 5/) did not have bursts. So, consonant onset was 
instead defined as the consonant release point, 
which was either the beginning of vocal fold 
vibration or fricative noise. 

Given the coarticulation effect, the vowel onset in 
/Ca/ was manually determined based on its spectrum 
and waveform. The consonants lb, d, g, v, z, 5/ had 
easily defined "boundaries" between consonant and 
vowel, which was the first vocal fold vibration after 
the aspiration noise. However, for /w, II, a boundary 
was more difficult to define, and the high-frequency 
spectrum and waveform properties were combined 
to locate the vowel onsets. Two examples of 
consonant and vowel onset labeling are given in 
Figure 2. Two utterances (AW and /ga 2 / from 
Talker M2) had a voicebar that was a low-frequency 
hum. After labeling, non-speech sounds such as lip 
smacking and preceding voicebars were deleted (set 
to zero). 

Figure 3 displays the consonant and vowel 
durations for /Ca/ syllables. The mean consonant 
and vowel durations, respectively, were 1 02 ms and 
439 ms for Talker Ml, 76 ms and 319 ms for Taker 
Fl , 80 ms and 388 ms for Talker M2, and 79 ms and 
374 ms for Talker F2. Talker Fl produced shorter 
vowels than other talkers, /ba, da, gal syllables had 
short consonant durations. Consonant durations in 
/wa, Sa/ were different across talkers. 

2.6 Generating AV Speech Stimuli 

2.6.1 Digitizing Video Tapes 

The video recordings on BETACAM tapes were 
digitized using an ACCOM real-time digital disk 
recorder. Uncompressed video images (740x486) 
were transferred to a PC as individual frame files. 
The corresponding acoustic tokens (48 kHz) were 
also transferred to the PC as individual files. These 
sounds were normalized (based on average RMS 
levels derived from A-weighted spectra). 

2.6.2 AV Pairing, Synchrony, and Distance 

For each talker, AV stimuli were generated by 
dubbing V /C 2a/ to A/bi a / and A/n a/ , and by dubbing 



V/cia/ tokens to and A/i 2a /. Therefore, every 

stimulus involved dubbing (e.g., V /bla/ and A^). 

In addition, each dubbing was achieved with C-, V-, 
and M-alignments that were derived using Equation 
1. For this purpose, acoustic signals from video 
recordings that were originally sampled at 48 kHz 
were down-sampled to 1 6 kHz. Speech signals were 
then divided into frames. The frame length and shift 
were 24 ms and 2.8 ms, respectively. For each 
acoustic frame, 16th-order LSPs [including the 
\og(Energy)] were obtained [see 5]. In total, there 
were 384 stimuli generated (8 V/ Ca / x 2 A /Ca/ x 2 
tokens x 3 alignments x 4 talkers). Figure 4 and 
Figure 5 display the alignments and the 
corresponding AV distances. The distances were 
smaller for A/b a/ than for A /la/ . The C- and V- 
alignments were different when the auditory and 
visual consonants had different durations. For 
example, in Figure 4, the C-, M-, and F-alignments 
of V/ za / (the 5th cluster) with A/ ba2 / of Talker F2 (the 
8th row) were different (i.e., having different 
vertical positions), and they produced different AV 
distances (i.e., different bar widths). 

2.6.3 AV Encoding 

For the video images, the top and bottom three lines 
were cropped, and the sequence of uncompressed 
frames for each stimulus was built into an AVI file 
that was compressed using the LIGOS LSX MPEG- 
Compressor. The resulting video clips had an image 
size of 720x480, a frame rate of 29.97 Hz, and a 
constant bitrate of 7700 Kbits/sec. These video clips 
were concatenated to create a single large video file 
that was authored to a DVD using the SONIC 
ReelDVD. As with the video, all of the audio files 
were concatenated into a single long file for 
production of the DVD. Audio concatenation was 
performed using custom software that ensured frame 
locked audio of 8008 samples per 5 video frames. 

2.7 Procedure 

AV stimuli were presented using a Pioneer DVD 
player and were displayed on a 14" SONY Trinitron 
monitor at a distance of about one meter from the 
participant. Audio was presented over calibrated 
TDH-49 headphones at a level of 65 dB SPL that 
was checked before and after each session. 

Participants performed an open-set identification 
task by typing their responses using a computer 
keyboard. Participants were instructed to watch and 
listen to the talkers, and then identify the consonant 
or consonants that they heard . Participants were 
directed to guess if necessary. An experimenter 
monitored the participants during testing. 
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At the beginning of each session, instructions were 
displayed on a PC monitor. After acknowledging 
having read the instructions, a computer program 
presented each of the stimuli and recorded 
behavioral responses. Following each stimulus, a 
black filled frame was displayed on the video 
monitor, and an input box was displayed on the PC 
monitor. After typing and double-checking the 
response, participants pressed the "ENTER" key to 
switch to the presentation of the next token. 
Participants were instructed to report any mistyping 
during breaks. No feedback was given at any time. 

The presentation of the 384 tokens in each session 
was administrated in two experimental designs. 

A/ ba/ V and A /Ia/ V mixed design (mixed design). 

The 384 tokens were blocked by talkers, and thus 
there were four blocks. Each block consisted of 96 
tokens, which comprised both A/ba/V and A /la/ V. 
Each block took about 10 minutes. 

A/ ba /V and A/i a/ V blocked design (blocked design). 

The 384 tokens were first blocked by talkers and 
then by auditory types (A/ba/ or A/i a/ ), and thus there 
were eight blocks. Each block consisted of 48 
tokens of A/b a/ V only or A /la/ V only from one talker. 
Each block took about 5 minutes. 

Half of the participants were tested on the mixed 
design, and the other half on the blocked design. 
The talker order was assigned randomly in each 
session. For the blocked design, the order of the 
A/ba/V and A /la /V blocks was randomized within each 
talker. Within each block, the tokens were randomly 
ordered. 

Participants were tested one session on each day and 
totally ten sessions. Each participant contributed a 
total of ten responses for each stimulus token. A 
five-minute break was given between blocks to 
prevent fatigue. Instruction on phonemic labeling 
and a practice set of 16 trials were given on Day 1. 

3. RESULTS 

Figure 6 shows the frequencies of the majority of 
responses to A^/V and A /la/ V. For A^V, the 
responses were typically individual consonants. 
Among the 23 consonants /y, w, r, 1, m, n, p, t, k, b, 
d, g, h, 8, 5, s, z, f, v, J, 3, tj, CI3/, six consonants 
/J, 3, tj, d3, y, h/ were not reported (individually or 
in combination). For A /la /V, there were many 
combination responses (e.g., /bl/). These 
combination responses were not symmetric. That is, 
there were no /lb/ responses for A/ la /V/ba/, although 
their C-, V-, and M-alignments were different. At 
the completion of the experiment, some participants 



reported having noticed mismatches between 
auditory and visual stimuli \ 

The results were tallied in terms of auditory-based 
accuracy {AA; e.g., the response to A/b a V/ wa / was 
"ba"), visual-based accuracy (VA; e.g., the response 
to A/ba/V/ va/ was "va"), voiceless responses (VL), and 
combinations (CO; e.g., "bla"). In order to 
determine which, if any, main factors were 
significant, each measure was submitted to an 
omnibus mixed measures analysis of variance, with 
video (7; excluding A/baV/ba/ or A/wV/w), talker (4), 
pairing (2), and alignment method (3) as the 
repeated factors. The stimulus presentation design 
(blocked versus mixed design) was the between- 
subject factor. A/baV and A /la/ V were analyzed 
separately. 

F-test values are listed in Table 1 for all ANOVAs. 
The results showed that pooling across alignment 
and presentation design was permissible. In 
addition, the pairing effect (see Section 2.6.2) was 
mainly due to an artifact in the /la^ sound spoken by 
Talker F2. Therefore, the AA, VA, VL, and CO 
measures examined below (see Figure 7 and Figure 
8) were pooled across the three alignments, two 
pairings, and ten participants. 



A/ba/V A/i„/V 

(N h N 2 ) AA VA VL CO AA VA VL CO 

Video (6,3) 22.4 23.2 2.8 1.7 5.6 5.2 1.2 2.2 

Talker (3,6) 10.1 2.7 4.1 2.3 17.4 3.1 1.5 26.3 

Pairing (1,8) 5.3 .0 1.9 .8 29.2 .0 .7 11.1 

Alignment (2, 7) .3 1.6 2.2 2.0 1.4 3.2 1.2 .9 

Presentation design (1, 8) 1.1 .3 .0 .2 .3 1.0 .4 .3 

Table 1. ANOVA results (lvalues) with different performance 
measures and different factors. (N\, N 2 ) represent degrees of 
freedom. The shaded areas indicate significant effects (p < .05). 

Figure 7 and Figure 8 showed that in general the 
A/ba/ produced more visually influenced responses 
(i.e., fewer AA, more VA, and more VL responses) 
than did A /la/ . A /ba/ produced more visually dominant 
responses (with V /va /, V/a a/ , and V /da /) than did A M 
(with V/va/). A/b a / produced more voiceless responses 
and fewer combination responses than did A/i a /. The 
scoring of the responses in terms of AA versus VA 
versus CO implies that whenever the number of all 
of these types of responses was low, the number of 
McGurk responses was high. Across the two types 
of audio stimuli, there were more McGurk responses 
for A/b a /V (e.g., A/b a /V/ ga/ , A ba /V/ za/ , and A/b a /V/i a/ ). 
Most of the A /la/ V stimuli showed no visual 



1 The reported mismatches were A,b a /V/ wa / (4 participants), AwV/a/ (1), 
A/ia/V/ma/ (1), A /ba /V/e/ and A/ia/V/w (1), Ai a /V /lla/ and A /ma/ V /b i a / (1). One 
participant rarely noticed any mismatch. Another participant noticed the 
mismatches, but could not give an example. 



Auditory-Visual Speech Processing 2005 (AVSP'05) 



42 



influence. But when there was a visual influence, it 
was most likely the combination type. 

If Figure 4, which shows distances, is compared 
with Figure 7, which shows responses for A/ba/V, an 
overall pattern of relationships can be seen. In 
particular, when visual distance was small, many 
responses were visually influenced (i.e., VA, VL, 
CO). If Figure 5, which shows distances, is 
compared with Figure 8, which shows responses for 
A/ia/, a similar overall pattern can be seen. In 
general, also, AV distances for A/b a /V (Figure 4) 
were smaller than those for A /la/ V (Figure 5). 

A more detailed examination of distance versus 
visual influence yielded additional systematic 
effects. For A /ba/ , AV distances for Talker F2 were 
smaller than those for other talkers, and Talker F2's 
A/ba/V stimuli yielded more visual influence than 
those of other talkers [Figure 7(a)]. For A/ba/V/da/, 
AV distances for Talkers M2 and F2 were smaller 
than those for other talkers, and their tokens yielded 
more visual influence [Figure 7(b)]. Figure 8(a) 
showed that the perception of A /la/ was affected by 
V/ba/, V/va/, and V /wa /. For A M , AV distances for 
Talker F2 were smaller than those for other talkers, 
and Talker F2's A /la/ V stimuli yielded more visual 
influence than those of other talkers [Figure 8(a)]. 
In addition, the influence of V /va / and V /wa / to A/, a / 
agrees with their smaller AV distances (Figure 5). 

Figure 7(c) and Figure 8(c) show voiceless 
responses to the AV stimuli. A/ba/V stimuli of Talker 
Fl, who has the lowest visual intelligibility ratings, 
yielded the largest number of voiceless responses. In 
general, A/ ba /V yielded more voiceless responses 
than A/ia/V. V/ba/ an d V /wa/ produced less voiceless 
responses for A^, and A /la/ than the other six V /Ca /. 

Figure 7(d) and Figure 8(d) show combination 
responses to the AV stimuli. In general, A /la/ yielded 
more combinations than A^. A M stimuli tended to 
yield combination responses with V /ba /, V /va/ , and 
V/wa/. As mentioned earlier, the artifact in the /hi/ 
sound spoken by Talker F2 appeared to be 
responsible for many combination responses. 

4. SUMMARY AND DISCUSSION 

A/ba/V and A/jJV mixed and blocked designs were 
not significantly different. Therefore, the attentional 
effect was not a main effect in the experimental 
design. Behavioral results indicated that the 
alignment effect was not significant. However, C-, 
V-, and M-alignments resulted in large differences 
in distance measures using Equation 1. Thus, 
although alignment was not a significant perceptual 
factor in the current study, it is possible that other 



response measures might be more sensitive and 
produce alignment effects. The visual influence on 
acoustic tokens varied as a function of syllable type. 
A/ba/ was more influenced by visual stimuli than 
A/ia/. Visual influence on A^ was of the McGurk- 
type and/or of voicing confusion; while visual 
influence on A^ was of the combination type. 
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Figure 1. AV alignments and distances for A /bal/ V/ za2 / (upper 
panel) and A /lal/ V /ba2 / (lower panel). The blue line between the 
Audio and the phantom Audio represents the distance measure. 
Lines with 'x' (black line), '+' (red line), and 'o' (green line) 
represent C-, V-, and M-alignments, respectively. 
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Figure 2. Consonant and vowel onsets for /gaj/ (Talker Ml). 
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Figure 3. Consonant and vowel durations (two tokens per /Ca/). 
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Figure 4. AV alignments and distances for A /ba/ V. Each row 
comprises data for one A /ba/ token from one talker. Each vertical 
bar represents the proportions of time measured for the vowel 
versus the consonant (see text). The lower part represents the 
consonant segment, and the upper part represents the vowel 
segment. The X axis labeling represents the different 
alignments (C-, M-, and K-alignments) for V /Ca/ (V/^, V /da /, 
V /ga/ , V /va/ , V /za/ , V/y, V /wa/ , and V /s/ ). The width in the X 
direction of a bar indicates the magnitude of the distance. 
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Figure 5. AV alignments and distances for A /la/ V. 
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Figure 6. Frequencies (X axis) of the various responses (X 
axis) to A^V and A/yV. The infrequent (not more than 10 
times for A /ba/ or A/y sound) response types are not listed. 
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Figure 7. The number (Y axis) of (a) auditory-based correct 
responses, (b) visual-based correct responses, (c) voiceless 
responses, and (d) combination responses with different visual 
stimuli (X axis) for A/ ba /V. 
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Figure 8. Behavioral responses for A/yV (refer to Figure 7). 
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